{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multi-Image Generation\n",
    "\n",
    "In this example, you will learn how to generate text from multiple images using the supported models: `Qwen2-VL`, `Pixtral` and `llava-interleaved`.\n",
    "\n",
    "Multi-image generation allows you to pass a list of images to the model and generate text conditioned on all the images.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mlx_vlm import load, apply_chat_template, generate\n",
    "from mlx_vlm.utils import load_image\n",
    "from mlx_vlm.utils import process_image"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "images = [\"images/cats.jpg\", \"images/desktop_setup.png\"]\n",
    "\n",
    "messages = [\n",
    "    {\"role\": \"user\", \"content\": \"Describe what you see in the images.\"}\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Qwen2-VL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load model and processor\n",
    "qwen_vl_model, qwen_vl_processor = load(\"mlx-community/Qwen2-VL-7B-Instruct-4bit\")\n",
    "qwen_vl_config = qwen_vl_model.config"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt = apply_chat_template(qwen_vl_processor, qwen_vl_config, messages, num_images=len(images))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==========\n",
      "Image: ['images/cats.jpg', 'images/desktop_setup.png'] \n",
      "\n",
      "Prompt: <|im_start|>system\n",
      "You are a helpful assistant.<|im_end|>\n",
      "<|im_start|>user\n",
      "Describe what you see in the images.<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|><|im_end|>\n",
      "<|im_start|>assistant\n",
      "\n",
      "The image shows a cozy home office setup with a pink blanket covering the desk and chair. There are two cats lounging on the blanket, one on the left side and the other on the right side of the desk. The desk has a computer monitor, keyboard, and mouse. There are also speakers on either side of the monitor, a remote control, and a small plant on the left side of the desk. The wall behind the desk has a framed text that says, \"Don't grow up, it's a trap.\" The overall scene is playful and relaxed, with the cats adding a touch of whimsy to the workspace.\n",
      "==========\n",
      "Prompt: 10.047 tokens-per-sec\n",
      "Generation: 28.392 tokens-per-sec\n"
     ]
    }
   ],
   "source": [
    "qwen_vl_output = generate(\n",
    "    qwen_vl_model,\n",
    "    qwen_vl_processor,\n",
    "    prompt,\n",
    "    images,\n",
    "    max_tokens=1000,\n",
    "    temperature=0.7,\n",
    "    verbose=True\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pixtral"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load model and processor\n",
    "pixtral_model, pixtral_processor = load(\"mlx-community/pixtral-12b-4bit\")\n",
    "pixtral_config = pixtral_model.config"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "prompt = apply_chat_template(pixtral_processor, pixtral_config, messages, num_images=len(images))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pixtral requires images to be resized to the same shape in multi-image generation\n",
    "resized_images = [process_image(load_image(image), (560, 560), None) for image in images]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==========\n",
      "Image: [<PIL.Image.Image image mode=RGB size=560x420 at 0x3ACDD5FF0>, <PIL.Image.Image image mode=RGB size=560x347 at 0x39D697820>] \n",
      "\n",
      "Prompt: <s>[INST]Describe what you see in the images.[IMG][IMG][/INST]\n",
      "The first image shows two cats lying on a pink couch. One cat is on the left side, and the other is on the right side. Both cats appear to be relaxed and comfortable. There are two remote controls on the couch, one near each cat. The background of the image is a plain, light-colored wall.\n",
      "\n",
      "The second image depicts a home office setup. The main elements include:\n",
      "\n",
      "1. A wooden desk with a computer monitor in the center.\n",
      "2. Two black speakers on either side of the monitor.\n",
      "3. A black office chair in front of the desk.\n",
      "4. A wooden shelf on the left side of the desk, holding various items including records and a potted plant.\n",
      "5. A framed poster on the wall behind the desk with the text \"Don't Grow Up, It's a Trap.\"\n",
      "6. A drum set on the right side of the image, partially visible.\n",
      "\n",
      "The overall setting of the second image is a modern, minimalist workspace with personal touches.\n",
      "==========\n",
      "Prompt: 2.203 tokens-per-sec\n",
      "Generation: 24.366 tokens-per-sec\n"
     ]
    }
   ],
   "source": [
    "pixtral_output = generate(\n",
    "    pixtral_model,\n",
    "    pixtral_processor,\n",
    "    prompt,\n",
    "    resized_images,\n",
    "    max_tokens=1000,\n",
    "    temperature=0.7,\n",
    "    verbose=True\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Llava-Interleaved"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load model and processor\n",
    "llava_model, llava_processor = load(\"mlx-community/llava-interleave-qwen-0.5b-bf16\")\n",
    "llava_config = llava_model.config"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt = apply_chat_template(llava_processor, llava_config, messages, num_images=len(images))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Expanding inputs for image tokens in LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==========\n",
      "Image: ['images/cats.jpg', 'images/desktop_setup.png'] \n",
      "\n",
      "Prompt: <|im_start|>user\n",
      "<image><image>\n",
      "Describe what you see in the images.<|im_end|>\n",
      "<|im_start|>assistant\n",
      "\n",
      "The image captures a cozy scene in a room. Two cats, one gray and the other brown and white, are lying on a pink couch. The gray cat is resting its head on the back of the couch, while the brown and white cat is lying on its side. They are both facing the camera, their relaxed postures suggesting a sense of comfort and tranquility.\n",
      "\n",
      "The room itself is a study in simplicity. A whiteboard hangs on the wall, a black computer monitor sits on a wooden desk, and a black speaker stands tall on a shelf. The walls are painted a light pink, providing a warm and inviting backdrop to the scene.\n",
      "\n",
      "On the desk, there's a black keyboard and a white mouse, ready for use. A plant sits on a small table next to the desk, adding a touch of nature to the room. A whiteboard eraser is also present on the desk, perhaps used for cleaning or marking.\n",
      "\n",
      "Overall, the image paints a picture of a comfortable and well-organized living space, where the cats have found their place and are enjoying their time.\n",
      "==========\n",
      "Prompt: 28.728 tokens-per-sec\n",
      "Generation: 52.317 tokens-per-sec\n"
     ]
    }
   ],
   "source": [
    "llava_output = generate(\n",
    "    llava_model,\n",
    "    llava_processor,\n",
    "    prompt,\n",
    "    images,\n",
    "    max_tokens=1000,\n",
    "    temperature=0.7,\n",
    "    verbose=True\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "mlx_code",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
